DeRiK: A German reference corpus of computer-mediated communication

نویسندگان

  • Michael Beißwenger
  • Maria Ermakova
  • Alexander Geyken
  • Lothar Lemnitzer
  • Angelika Storrer
چکیده

The paper describes an ongoing project that aims at building a reference corpus of German computer-mediated communication (CMC) as a new component of an already existing reference corpus of written contemporary German. The ‘Deutsches Referenzkorpus zur internetbasierten Kommunikation’ (DeRiK) shall include data from the most prominent CMC genres amongst German Internet users and, thus, close a gap in the coverage of the corpus resources in the project “Digitales Wörterbuch der deutschen Sprache” (DWDS) which are maintained and provided by the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW). The focus of the paper is on the role of the DeRiK component within the DWDS framework, on sampling issues, and on CMC-specific issues of corpus annotation. 1. Project Background and Focus of the Paper In view of the increasing amount of reading and writing that people do on the Internet, up-to-date corpora of written contemporary language must take into consideration the impact of computer-mediated communication (CMC) on contemporary language and, thus, include samples of emerging written genres such as e-mail, weblogs, microblogging on Twitter, discussion boards and wiki discussions, chats and instant messaging conversations, and communication in social network sites. In this paper we present selected aspects of an ongoing project that aims at building a reference corpus of German CMC, called DeRiK (‘Deutsches Refe2 renzkorpus zur internetbasierten Kommunikation’).1 DeRiK is a joint initiative of TU Dortmund University and the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW) and is embedded in the scientific network “Empirical Research on Internet-based Communication” (http://www.empirikom.net/) funded by the Deutsche Forschungsgemeinschaft (DFG). The corpus will be integrated into the lexical information system provided by the BBAW project “Digitales Wörterbuch der deutschen Sprache” (DWDS, WWW.DWDS.DE).2 The focus of this paper is on the role of the DeRiK component within the DWDS framework and on sampling issues (section 2) as well as on CMC-specific issues of corpus annotation (section 3). 2. Integrating CMC Discourse into a Corpus of Contemporary German: Motivation, Sampling, and Application Fields DWDS (WWW.DWDS.DE) is a lexical information system developed by and hosted at the BBAW. The system offers one-click-access to three different types of resources (Geyken, 2007): a) lexical resources: a common language dictionary3, an etymological dictionary, and a thesaurus; b) corpus resources: a balanced reference corpus (called ‘DWDS core corpus’) of German ranging from 1900 up to now, a set of additional newspaper corpora, and specialized corpora; c) statistical resources for words and word combinations. These resources are displayed alongside one another in separate panels (see Fig. 1). The system offers the choice among several views, i.e. between several profiles with predefined panel combinations. The CMC component DeRiK (‘Deutsches Referenzkorpus zur internetbasierten Kommunikation’) will be integrated into this framework both as an independent panel and as a subcorpus of the DWDS core corpus. The data for DeRiK shall be collected not only once but on a regular basis; DeRiK, thus, will consist of several partial corpora, each of them representing data that has been collected at a certain point of time (e.g., within one year). The sampling of the data is guided by the findings of the “ARD/ZDF-Onlinestudie”, a German online usage survey (WWW.ARDZDF-ONLINESTUDIE.DE) which reveals the usage preferences of German Internet users on an annual basis and according to online applications and age groups. The findings 1 http://www.empirikom.net/bin/view/Themen/DeRiK 2 Another corpus of contemporary language which aims to include a CMC subcorpus is the Dutch SoNaR project (Reynaert et al. 2010). 3 This dictionary is based on a six-volume paper dictionary, the “Wörterbuch der deutschen Gegenwartssprache” (WDG, en.: ‘Dictionary of Contemporary German’) published between 1962 and 1977 and compiled at the Deutsche Akademie der Wissenschaften (Klappenbach/Steinitz (eds.) 1964-1977). 3 of this survey allow us to derive an ideal key for the composition of the DeRiK partial corpora, i.e. for deciding which CMC technologies have to be regarded as most prominent amongst German Internet users in any year and in which proportion discourse conducted on the basis of those technologies should be represented in the corpus. However, for practical reasons the project will set out to collect data of only those instances of CMC technologies indicated by the online survey for which the users have explicitly granted permission for (re-)distributing and (re-)using their written utterances for non-commercial purposes/academic research (e.g., by assigning the respective subtypes of the “Creative Commons” License to CMC documents or to CMC applications on the web). Thus, the key derived from the findings of the annual online survey will describe an ideal compilation (with ideal proportions of the CMC genres) while the legal constraints will compel us to implement this ideal key only in modified form. Since the data will be collected over several years, we will have the possibility to adapt our key for each phase of data collection – to changing usage preferences according to the most recent version of the online survey as well as to changes in IPR restrictions on the use of CMC data retrieved from the web for scientific purposes. The first partial corpus of DeRiK will mostly include discourse from Wikipedia talk pages, a selection of forum and weblog discussions, chat conversations, and postings of selected Twitter users. The integration of the CMC reference corpus into the DWDS system may be valuable for various research and application fields, for example: a) Language variation, language change and stylistics: A general-language corpus that includes a CMC component will provide a broad empirical basis (a) for further, corpus-based investigations of the usage and dissemination of CMCspecific phenomena across linguistic varieties and digital genres, and (b) for comparative analyses of the features of CMC discourse and of “traditional” written genres (e.g. newspaper, fiction, scientific writing, nonliterary prose); it will thus facilitate to track and describe how new linguistic patterns and communicative genres emerge4. b) Lexicology and lexicography: Besides genre-specific discourse markers and “netspeak” jargon (like ‘lol’ laughing out loud or ‘imho’ in my humble opinion), new vocabulary is characteristic for CMC discourse, e.g. ‘funzen’ (an abbreviated variant of ‘funktionieren’ to function) or ‘gruscheln’ (a function of a German social network platform, most likely a blending of ‘grüßen’ greet and ‘kuscheln’ cuddle). There are also CMC-specific processes of lexical-semantic changes, e.g. the broadening of the concept of ‘Freund’ (friend). Up-to-date lexical resources should document and describe these tendencies by integrating CMC data into their data basis. Once the first partial corpora of the DeRiK corpus are made 4 Overviews of the features of CMC discourse from a linguistic perspective can be found, e.g., in Herring (ed., 1996; 2010), Runkehl et al. (1998), Crystal (2001; 2011), Beißwenger/Storrer (2008), and Storrer (2012). 4 available in the DWDS system, it is intended to extend the DWDS dictionary component with entries describing new lexemes that have evolved from CMC discourse. In addition, the DWDS corpus system will then allow one to track how new vocabulary from CMC discourse (such as the examples mentioned above) spreads into “traditional” genres (e.g. newspaper, fiction, nonliterary prose). c) Language teaching: CMC has become an important part of everyday communication. Languageand culture-specific properties of CMC should, thus, also be taken into consideration in communicative approaches to Second Language Teaching. In this context, the DeRiK corpus and the documentation of CMC vocabulary in the DWDS dictionary may be useful resources. In school teaching, students with German as a native language may use the DWDS system to compare “traditional” written language with CMC and to explore how style varies across different genres. Fig. 1: Web frontend of the DWDS system (http://www.dwds.de) 3. Annotation of CMC-Specific Phenomena One advantage of integrating DeRiK into the DWDS system is that users can profit from the DWDS corpus annotation and querying facilities: The corpus resources which are currently available in the DWDS system are lemmatized with the TAGH morphology (Geyken and Hanneforth, 2006) and tagged with the part-of-speech 5 tagger moot (Jurish, 2003). The corpus search engine DDC (Dialing DWDS Concordancer) supports linguistic queries on several annotation levels (word forms, lemmas, STTS part-of-speech categories) as well as in filtering (e.g. by text type) and sorting options. Since all corpus resources in the DWDS system are encoded according to the guidelines of the Text Encoding Initiative (TEI-P5), the project uses TEI also for the annotation of its CMC component. For this purpose, we have developed a TEIcompliant annotation schema that provides  a macrostructure of CMC discourse which covers a broad range of CMC genres (see section 3.1);  a partial schema for the description of selected CMC-specific phenomena (“interaction signs”: emoticons, interaction words, interaction templates, addressing terms; see section 3.2 for details). A detailed description of the TEI schema for DeRiK is given in Beißwenger et al. (2012)5. The discussion in this paper will focus on two core issues: The representation of CMC-specific microand macrostructures (section 3.1) and the annotation of typical “netspeak” elements (section 3.2). 3.1 Annotation of CMC-Specific Microand Macrostructures We introduced the category posting as a basic element to capture CMC microand macrostructures. A posting is defined as a content unit that is being sent to the server “en bloc”. Postings can usually be recognized by their formal structure, even if they have different forms and structures across CMC genres. This facilitates the automatic segmentation and annotation of CMC microand macrostructures. We use the term microstructure to refer to the internal structure of postings. There are cases in which a posting consists of exactly one portion of text. In other CMC genres, e.g. in discussion groups, postings may contain divisions and markup used by the author to structure their content. We use the term macrostructure to describe how the postings are sequenced. While microstructures are generated by an individual author, macrostructures do not emerge from the actions of just one user but from all posting activities of all users involved in a CMC conversation plus server routines for ordering the incoming postings. Our TEI makes a distinction between two major types of CMC macrostructures:  logfile structures, which arrange the postings in a linear chronological order based on when they reached the server (as is the case in chats and instant messaging data); 5 The RNG schema file, a TEI-compliant ODD documentation as well as encoding examples are available at http://www.empirikom.net/bin/view/Themen/CmcTEI. 6  thread structures, which arrange the postings using two dimensions with specific semantics: the above/below dimension representing a temporal “before/after” relation; the left/right dimension (by indentation), which usually symbolizes the topical affiliation of one posting to a previous posting (as is the case, e.g., in forum, weblog, and wiki discussions). 3.2 Annotation of Interaction Signs The corpus-based investigation of “netspeak” jargon is interesting in many research contexts (style variation and language change, discourse management, language teaching, etc.). Our annotation schema comprises elements for a set of “netspeak” phenomena which we term “interaction signs”. The term builds on the category “interaktive Einheiten” which was introduced in the three-volume scientific grammar of the German language Zifonun et al. (1997) to classify interjections (such as “hm” or “oh my god”) and responsives (such as “yes” and “no”) in spoken discourse. In contrast to part-of-speech-categories, interaction signs are not syntactically integrated and do not contribute to the compositional structure of sentences. In spoken discourse, they serve as devices for conversation management, i.e. they can be used to express reactions to the partners’ utterances or to display the speaker’s emotions. Besides interjections and responsives, the category “interaction sign” includes four CMC-specific subcategories (see Fig. 2): 1. Emoticons, which are iconic units that are created with the keyboard and which typically serve as emotion or irony markers or as responsives. Being of iconic origin, the use of emoticons is not restricted to a specific language. However, different styles of emoticons exist – e.g. Western style emoticons such as :-), :-(, ;-), or the :), or Japanese style emoticons such as (^_^), \(^_^)/, (*_*). 2. Interaction words, which are symbolic linguistic units whose morphologic construction is based on a word or a phrase. They may describe gestures or facial expressions, e.g. *g* (< “grins” grin), *fg* (< fat grin), *s* (< smile), or they are used for the simulation of actions and events. 3. Interaction templates, which are units that the user does not generate with the keyboard but which are generated automatically from a file with a previously prepared text or graphical element after the user has activated a template. 4. Addressing terms, which are units that are used to address an utterance to a particular interlocutor.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of CMC in Business Emails in Lingua Franca: Discourse Features and Misunderstandings

The paper argues that everyday exchange of business emails produces a development in the work-group relationship, which, in turn, makes new communication styles possible and acceptable by the users' habit to computer-mediated forms, even in unbalanced professional exchanges. The focus is on the (spoken) discourse features of email messages in a self-compiled corpus of selected computer-mediated...

متن کامل

Gender and Computer-Mediated Communication: Emoticons in a Digital Forum in Persian

This study aimed to gain an insight into whether computer-mediated communication (CMC) in the form of a digital forum can reflect gendered discursive practices. A great deal of research has now established that computer-mediated interactions embody gendered differences in the use of emoticons, but few studies have examined the potential effect of the gender of the emoticon-receiver on the frequ...

متن کامل

A Linguistic Analysis of the Online Debate on Vaccines and Use of Fora as Information Stations and Confirmation Niche

This study looks at the communication between users concerning health risks, with the aim of exploring their use of fora and assessing whether participants establish a niche with like-minded users during these exchanges. By integrating a corpus linguistic approach with content analysis and multiple studies on computer mediated health discourse, this study analyses the intense attention paid to ...

متن کامل

Learning Pragmatics through Computer-Mediated Communication in Taiwan

This study investigated the effectiveness of explicit pragmatic instruction on the acquisition of requests by college-level English as Foreign Language (EFL) learners in Taiwan. The goal was to determine first whether the use of explicit pragmatic instruction had a positive effect on EFL learners’ pragmatic competence. Second, the relative effectiveness of presenting pragmatics through two deli...

متن کامل

The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

English. The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English)....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • LLC

دوره 28  شماره 

صفحات  -

تاریخ انتشار 2013